Appendix C — Assignment 3

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Friday, 2th May 2025 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn’t seem genuine, you will lose points.
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

C.1 1) Regression Problem - Miami housing

C.1.1 1a) Data preparation

Read the data miami-housing.csv. Check the description of the variables here. Split the data into 60% train and 40% test. Use random_state = 45. The response is SALE_PRC, and the rest of the columns are predictors, except PARCELNO. Print the shape of the predictors dataframe of the train data.

(2 points)

C.1.2 1b) Baseline Decision Tree Model

Train a Decision Tree Regressor to predict SALE_PRC using all available predictors.

  • Use random_state=45 and keep all other hyperparameters at their default values.
  • After training the model, evaluate and report the following on both the training and test sets:
    • Mean Absolute Error (MAE)
    • R² Score

(3 points)

C.1.3 1c) Tune the Decision Tree Model

Tune the hyperparameters of the Decision Tree Regressor developed in the previous question and evaluate its performance.

Your goal is to achieve a test set MAE (Mean Absolute Error) below $68,000.

  • You must display the optimal hyperparameter values obtained from the tuning process.
  • Compute and report the test MAE and R² Score using the tuned model.

Hints: 1. BayesSearchCV() with max_depth and max_features can often complete in under a minute. 2. You may use any hyperparameter tuning method (e.g., GridSearchCV, RandomizedSearchCV, BayesSearchCV). 3. You are free to choose which hyperparameters to tune and define your own search space.

(9 points)

C.1.4 1d) Bagged Decision Trees with Out-of-Bag Evaluation

Train a Bagging Regressor using Decision Trees as base estimators to predict SALE_PRC.

  • Use only the n_estimators hyperparameter for tuning; keep all other parameters at their default values.
  • Increase the number of trees (n_estimators) until the out-of-bag (OOB) MAE stabilizes.
  • Report the final OOB MAE, test MAE, and R² Score, and ensure that the OOB MAE is less than $48,000.

(4 points)

C.1.5 1e) Bagged Decision Trees Without Bootstrapping

Train a Bagging Regressor using Decision Trees, but this time disable bootstrapping by setting bootstrap=False.

  • Use the same n_estimators value as in the previous question.
  • Compute and report the following on the test set:
    • Mean Absolute Error (MAE)
    • R² Score

Explain why the test MAE in this case is:

  • Much higher than the MAE obtained when bootstrapping was enabled (previous question).
  • Lower than the MAE obtained from a single untuned decision tree (as in Question 1(b)).

💡 Hint: Consider the impact of bootstrap sampling on variance reduction and the benefits of aggregation in ensemble methods.

(2 point for code, 3 + 3 points for reasoning)

C.1.6 1f) Bagged Decision Trees with Feature Bootstrapping Only

Train a Bagging Regressor using Decision Trees, with the following configuration: - Disable sample bootstrapping by setting bootstrap=False - Enable feature bootstrapping by setting bootstrap_features=True

Use the same number of estimators (n_estimators) as in the previous bagging experiments.

  • Compute and report the following on the test set:

    • Mean Absolute Error (MAE)
    • R² Score

Explain why the test MAE obtained in this setting is much lower than the one in the previous question, where neither bootstrapping samples nor features was used.

(2 point for code, 3 points for reasoning)

C.1.7 1g) Tuning a Bagged Tree Model

C.1.7.1 1g)i) Approaches

There are two common approaches for tuning a bagged tree model:

  1. Out-of-Bag (OOB) Prediction
  2. KK-fold Cross-Validation using GridSearchCV

What is the advantage of each approach over the other? Specifically:

  • What is the advantage of the out-of-bag approach compared to KK-fold cross-validation?
  • What is the advantage of KK-fold cross-validation compared to the out-of-bag approach?

(3 + 3 points)

C.1.7.2 1g)ii) Tuning the hyperparameters

Tune the hyperparameters of the bagged tree model developed in 1(d). You may use either of the approaches mentioned in the previous question. Show the optimal values of the hyperparameters obtained. Compute the MAE and R² Score on test data with the tuned model. Your test MAE must be less than the test MAE ontained in the previous question.

It is up to you to pick the hyperparameters and their values in the grid.

Hint:

GridSearchCV() may work better than BayesSearchCV() in this case.

(9 points)

C.1.8 1h) Random Forest

C.1.8.1 1h)(i) Tuning a Random Forest Model

Train and tune a Random Forest Regressor to predict SALE_PRC.

  • Select hyperparameters and define your own tuning grid.
  • Use any tuning approach (e.g., Out-of-Bag (OOB) evaluation or KK-fold cross-validation).
  • Report the following performance metrics on the test set:
    • Mean Absolute Error (MAE)
    • R² Score

✅ Your goal is to achieve a test MAE below $46,000.

Hint:
The OOB approach is efficient and can complete in under a minute.

(9 points)

C.1.8.2 1h)(ii) Feature Importance

After fitting the tuned Random Forest Regressor, extract and display the feature importances.

  • Print the predictors in decreasing order of importance based on the trained model.
  • This helps identify which features contribute most to predicting SALE_PRC.

(4 points)

C.1.8.3 1h)(iii) Feature Selection

Drop the least important predictor identified in the previous step, and re-train the tuned Random Forest model.

  • Compute the test MAE and R² Score after dropping the feature.
  • You may need to adjust the max_features hyperparameter to reflect the reduced number of predictors.
  • Compare the new test MAE with the previous one.

❓ Did the test MAE decrease after removing the least important feature?

(4 points)

C.1.8.4 1h)(iv) Random Forest vs. Bagging: max_features

The max_features hyperparameter is available in both RandomForestRegressor() and BaggingRegressor().

Does max_features have the same meaning in both models?
If not, explain the difference in how it is interpreted and applied.

💡 Hint: Refer to the scikit-learn documentation for both estimators to understand how max_features affects feature selection during training.

(1 + 3 points)

C.2 2) Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.

There is a train data - train.csv, which you will use to develop a model. There is a test data - test.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit?

  6. duration: Call duration, in seconds. This attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)

C.2.1 2a) Data Preparation

Begin by examining the distribution of the target variable in both the training and test sets. This will help you assess whether there is any significant class imbalance.

Next, consider the two available approaches for hyperparameter tuning:

  • Cross-validation (CV)
  • Out-of-bag (OOB) evaluation

C.2.1.1 ❓ Which method do you prefer for this dataset, and why?

Discuss your choice based on:

  • The size of the dataset
  • The class imbalance in the target variable
  • The reliability and interpretability of each method
  • Whether you need stratified sampling to preserve class distribution during evaluation

(2 points)

C.2.2 2b) Random Forest for Term Deposit Subscription Prediction

Develop and tune a Random Forest Classifier to predict whether a client will subscribe to a term deposit using the following predictors:

  • age
  • education
  • day
  • month

The model must satisfy the following performance criteria:

C.2.2.1 ✅ Requirements:

  1. Minimum overall classification accuracy of 75%, across both train.csv and test.csv.
  2. Minimum recall of 60%, across both train.csv and test.csv.

You must:

  • Print the accuracy and recall for both datasets (train.csv and test.csv).
  • Use cross-validation on the training data to optimize the model hyperparameters.
  • Select a threshold probability for classification and apply it consistently across both datasets.

C.2.2.2 ⚠️ Important Notes:

  1. Do not use duration as a predictor. Its value is determined after the marketing call ends, so using it would leak information about the outcome.

  2. You are free to choose any decision threshold for classification, but the same threshold must be used consistently for both training and test evaluation.

  3. Use cross-validation to tune hyperparameters such as max_features, max_depth, and max_leaf_nodes.
    - You may use StratifiedKFold or any appropriate CV method that respects class imbalance.

  4. After tuning the model, plot cross-validated accuracy and recall across a range of threshold values (e.g., 0.1 to 0.9). Use this plot to select a threshold that satisfies the required trade-off between accuracy and recall.

  5. Evaluate the final tuned model (with the chosen threshold) on the test dataset. Do not use the test data to guide any part of the tuning or threshold selection.

C.2.2.3 💡 Hints:

  • Restrict the search space to:
    • max_depth ≤ 25
    • max_leaf_nodes ≤ 45
      These limits encourage generalization and help balance recall and accuracy.
  • Consider using cross-validation scores to compute predicted probabilities when plotting recall/accuracy curves.

C.2.2.4 📝 Scoring Breakdown (22 points total):

  • 8 points – Hyperparameter tuning via cross-validation
  • 5 points – Plotting accuracy and recall across thresholds
  • 5 points – Threshold selection based on the plot
  • 4 points – Reporting accuracy and recall on both datasets

C.3 3) Predictor Transformations in Trees

Can a non-linear monotonic transformation of predictors (such as log(), sqrt(), etc.) be useful in improving the accuracy of decision tree models?

Provide a brief explanation based on your understanding of how decision trees split data and handle predictor scales.

(4 points for answer)